Introduction

We found the Forest Covertype in the UCI Machine Learning Repository that takes forestry data from the Roosevelt National Forest in northern Colorado (Click here for a tour of the area). The observations are taken from 30m by 30m patches of forest that are classified as one of seven forest types:

  1. Spruce/Fir
  2. Lodgepole Pine
  3. Ponderosa Pine
  4. Cottonwood/Willow
  5. Aspen
  6. Douglas-fir
  7. Krummholz

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Kaggle hosted the dataset in a competition with a training set of 15,120 observations and a test set of 565,892 observations. The relative sizes of the training and test sets makes classification of cover type a challenging problem. We decided to use the machine learning and visualization packages available in R for this project.

Data Exploration

Name Measurement Description
Elevation meters Elevation in meters
Aspect azimuth Aspect in degrees azimuth
Slope degrees Slope in degrees
Horizontal Distance To Hydrology meters Horz Dist to nearest surface water features
Vertical Distance To Hydrology meters Vert Dist to nearest surface water features
Horizontal Distance To Roadways meters Horz Dist to nearest roadway
Hillshade 9am 0 to 255 index Hillshade index at 9am, summer solstice
Hillshade Noon 0 to 255 index Hillshade index at noon, summer soltice
Hillshade 3pm 0 to 255 index Hillshade index at 3pm, summer solstice
Horizontal Distance To Fire Points meters Horz Dist to nearest wildfire ignition points
Wilderness Area (4 binary columns) 0 (absence) or 1 (presence) Wilderness area designation
Soil Type (40 binary columns) 0 (absence) or 1 (presence) Soil Type designation
Cover Type Classes 1 to 7 Forest Cover Type designation - Response Variable

Some class seperation is clearly visible in the following plots of elevation.

Another compelling variable is aspect, or the cardinal direction that the slope has the steepest gradient downwards. For example, in the rose diagram below, there are more Douglas-Fir trees for observations with northern aspects (near 0º) than southern aspects (near 180º).

Across Wilderness areas, there seems to be some class seperation . Cottonwood Willow covers are found only in Cache la Poudre areas and Neota area comprises of only Spruce-fir,Krummholz and LodgePole-Pine covers.

One of the clearer relationships in the dataset was the distribution of Hillshade luminance. Hillshade is measured on a scale from 0 to 255 (dark to bright).

Hillshade at time t varies as a factor of: \[\cos(slope)\cos (90- Altitude) + \sin (slope)\sin (90-Altitude)\cos(Azimuth-Aspect)\] where Altitude is the angle of the Sun relative to the horizon and Azimuth relates to the direction the Sun is facing: North, South, East, or West. Azimuth of 90 degrees corresponds to East.

This equation actually arises from a theorem in Spherical geometry known as “Spherical law of Cosines” relating the sides and angles of triangles constructed on spherical surfaces.


In a unit sphere, lengths a, b, c correspond to the angle subtended by those sides from the center of the sphere. If we know the two sides a, b and the angle between them C, then the cosine of c, is given by:

In short, the Illumination of the patch is related to alitude of the sun, slope of the terrain and the relative direction of the sun and the slope.

\(Hillshade(t_1,t_2,t_3)\) has has been plotted below in 3 dimensions. Click/drag on the plot to explore:

More importantly, in the context of class seperation, The following plot of elevation, slope, and hillshade clearly seperates the forest cover types, demonstrating that elevation could be the most significant factor in determining cover type.

The primary reason for the collection of cartographic data pertains to terrain mapping and is ultimately useful for applying topographic correction to satellite images in remote sensing or as background information in scientific studies.

Topographic correction is necessary if, for example, we wish to identify materials on the Earth’s surface by deriving empirical spectral signatures, or to compare images taken at different times with different Sun and satellite positions and angles. By applying the corrections, it is possible to transform the satellite-derived reflectance into their true reflectivity or radiance in horizontal conditions.

Modeling

Logistic Regression

We first investigated multinomial logistic regression. Since classical multiple regression in sensitive to high variance of coefficient estimates in cases of correlated predictor variables, we decided to use the glmnet package, which executes logistic regressions with regularization. Using the dataset x, the probability of the response class k of 1 through 7 is the following:

\[\mbox{Pr}(G=k|X=x)=\frac{e^{\beta_{0k}+\beta_k^Tx}}{\sum_{\ell=1}^7e^{\beta_{0\ell}+\beta_\ell^Tx}}\]

We then transfer this into the elastic-net penalized negative log-likelihood function (the first term of the following equation):

\[\ell(\{\beta_{0k},\beta_{k}\}_1^7) = -\left[\frac{1}{N} \sum_{i=1}^N \Big(\sum_{k=1}^7y_{il} (\beta_{0k} + x_i^T \beta_k)- \log \big(\sum_{k=1}^7 e^{\beta_{0k}+x_i^T \beta_k}\big)\Big)\right] +\lambda \left[ (1-\alpha)||\beta||_F^2/2 + \alpha\sum_{j=1}^p||\beta_j||\right]\]

Where k is the response class (cover type from 1 to 7) and N is the number of observations. The second term in the equation represents the regularization term.

The elastic net penalty \(\alpha\) varies from 0 to 1. When \(\alpha = 0\), the function reduces to Ridge regularization. When \(\alpha = 1\), Lasso regularization is used. The Ridge penalty shrinks the coefficients of closely correlated predictors wheras the Lasso tries to pick one and discard others.

Ridge Regularization

We performed 10-fold cross validation on a ridge regularization and on a grid of 100 values of \(\lambda\), choosing the minimum \(\lambda\) for each model. When we ran the regression with a 70-30 cross validation split on the training set, we got an accuracy on the test set on Kaggle of 0.5570.

When we reran the ridge regression with an 85-15 split on the training data, we got a Kaggle accuracy of 0.5956. Since the test set is so much larger than our training set, it was beneficial to incorporate a larger proportion of the data for learning within the training data before runnint the model on the larger testing data.

Lasso Regularization

A 10-fold cross validated lasso-regularized logistic regression kaggle score was 0.59594 for 70-30 split kaggle score of 0.59526, for 85-15 split kaggle score of 0.59443 if we took the lowest 50 percentile of lambdas and calculated the mode of the predictions.

Elastic-net Regularization

Using both the ridge and lasso regularization terms concurrently produces an elastic-net regularization. We ran this hybrid method with both a 70-30 split and 85-15 split in the training set. When running these models on the testing set, we got accuracies on Kaggle of 0.5959 and 0.5952, respectively.

We also tried an elastic-net regression using all 15,120 observations in the training set with an Alpha parameter of 0.5. It earned an accuracy score on the testing set of 0.5950.

Artificial Neural Networks

We first modeled the data in a feed-forward neural network with a single layer of nodes. Later, we created deep learning models where we added more layers of hidden nodes.

Single Hidden Layer Using the nnet Package

The nnet R library allows the construction of feed-forward single-layer nerual networks. After scaling our variables, and a basic run we decided to tune the nnet function based on the following parameters - Number of nodes - Number of iterations - Maximum number of weights (since this depends on the number of nodes)

we experimented to see the optimal number of nodes for our hidden layer by checking our accuracy levels over various number of nodes.

We finally chose a 54-100-7 node neural nets on our training data achieving 0.52113 accuracy for 300 maximum iterations and 0.52432 accuracy for 500 maximum iterations.

Deep Learning using H2o

For adding multiple hidden layers, we used the deeplearning function using the H2o package. H2o is an R package that runs parallel-distributed machine learning algorithms to improve computational speed. Later, we also used H2o’s Gradient Boosting Machine (GBM) algorithm to acheive our second best accuracy rate.

Some features of H2o include - In-memory compression techniques to handle large datasets - Automatic standardization - Randomized Grid Search - Fast algorithms deployed in Java - Anomaly detection to identify outliers

When we first ran a basic model, our validation results were poor and we decided to experiment with varying the default parameters using a grid search. We tried variations of one to three hidden layers with between 30 to 200 nodes each. We also varied the input dropout ratio, the fraction of training features to omit for improving generalizability. The next parameter was the rate, or how much the network adjusts over time. Finally, we tuned the rate annealing parameter that reduces the liklihood that the learning rate chooses local minima while optimizing. H2o allows for “hyper-parameter” tuning in the following format:

r hyper_params <- list( hidden=list(c(30),c(50),c(70),c(100),c(130),c(160),c(200),c(250),c(30,60),c(60,90),c(90,120),c(120,150),c(30,60,90),c(60,90,120),c(90,120,160)), input_dropout_ratio=c(0,0.05), rate=c(0.01,0.02), rate_annealing=c(1e-8,1e-7,1e-6))

From our first tuning run, the neural network had an accuracy of 0.60186 with a 54-120-150-7 network.

After returning to the model, we decided to experiment with a wider range of the tuning parameters. Our best model was a 54-500-800-7 network with a learning rate of 0.02, rate annealing of 1e-05, input dropout ratio of 0, and an accuracy of 0.674.

Notable studies from the past:

The earliest use of the dataset was in a doctoral dissertation. Jock Blackard and Denis Dean at Colorado State University used Linear Discriminant Analysis to obtain a 58% accuracy and Artificial Neural network to achieve 70% accuracy. The paper concludes by saying that ANN does have its drawbacks, it is still more viable than traditional methods, by which we think he refers to logistic regression, LDA and Support Vector Classifiers.
In 2007, Jain & Minz (Journal of Indian Agricultural Statistics) applied Rough Set Theory as an alternative to Machine learning approaches to achieve equally good but using lesser attributes (only 29) to achieve 78% accuracy for the forest cover dataset. This is especially relevant in real world classification applications in remote sensing involving large datasets.

In 2004, in “Systems, Man and Cybernetics” an IEEE International Conference Journal publication, Xiaomei Liu, Kevin W. Bowyer of the University of Notre Dame use the forest cover dataset to conjecture something interesting. In their own words:

“Based on some experiments with the Forest Cover Type Data Set, we conjectured a class of problems that are extremely difficult for FFBP NNs but relatively simple for DTs. The features of such problems are that some minority classes are surrounded and overlapped by some majority classes. We generated synthetic data sets with the above description. In our experiments on the synthetic data sets, we varied the size and the number of the islands for the minority class. We also varied the ratio of the minority class and the majority class for each island. We tried a plain feed forward back propagation NN, a Rprop NN, and a cascade correlation NN. With all of these experiments, the DT performed better than each kind of FFBP NN”

Based on these findings, we decided to move on from neural networks in favor of tree-based methods.

## Tree Based Methods

### Random Forest

#### Boosting Using XGBoost

#### Boosting Using H2o’s Gradient Boosting Algorithm

#### Extremely Randomized Trees

Support Vector Machines ?

Ensembling

Results Summary

Classifier Feature Engineering Parameters Accuracy Rank
Logistic Regression - Ridge No 70-30 split 0.55696 1,489
Logistic Regression - Ridge No 85-15 split 0.59550 1,414
Logistic Regression - Lasso No 70-30 split best lambda 0.59594 1,411
Logistic Regression - Lasso No 85-15 split best lambda 0.59526 1,415
Logistic Regression - Lasso No 85-15 split,lowest 50 percintile of lambdas and calculated the mode of the predictions 0.59443 1,418
Logistic Regression - Elastic-net No 70-30 split 0.59588 1,412
Logistic Regression - Elastic-net No 85-15 split 0.59520 1,415
Logistic Regression - Elastic-net No using whole training set 0.59496 1,417
Random Forest No 80-20 split 0.68932 1,286
Random Forest No using whole training set 0.70748 1,210
Random Forest No ntree=500, mtry=13 0.75168 792
Random Forest No ntree=500, mtry=17 0.754 672
Random Forest Yes ntree=500, mtry=13 0.69115 1,282
Random Forest Yes ntree=500, mtry=20 0.71929 1,221
Random Forest Yes ntree=500, mtry=30 0.71831
Neural Network No nnet pack, whole train, maxit=300 0.50867
Neural Network No nnet package, whole train set, maxit=500 0.51487
Neural Network No nnet package, 80% training set, maxit=300 0.52113
Neural Network No nnet package, 80% training set, maxit=500 0.52432
Neural Network - Deep Learning No H2o package, 0.5946
Neural Network - Deep Learning No H2o package, 54-100-120-7 0.60186
Neural Network - Deep Learning No H2o package, 54-120-150-7 0.60186
Neural Network - Deep Learning No H2o package, 54-500-800-7, rate=0.02, rate-annealing=1e-05, rate-decay=1 .67428
XGBoost No nround=50 0.70033 1,258
XGBoost No nround=104 0.73438 991
XGBoost Yes after 800 run 0.72949
XGBoost Yes cv run by sequentially choosing the 62 variables based on importance and chose the optimal set of parameters 0.72656
GBM - H2o No ntrees=250, max_depth=18, min_rows=10, learn_rate=.1 0.76640 467
GBM - H2o No ntrees=200, max_depth=20, min_rows=10, learn_rate=.1 0.76119 521
GBM - H2o No ntrees=200, max_depth=20, min_rows=10, learn_rate=.1, cv 80/20 0.74509 911
GBM - H2o No w/o stopping_rounds=2, stopping_tolerance=0.01, score_each_iteration=T, 10-fold 0.76586 479
GBM - H2o No w/o stopping_rounds=2, stopping_tolerance=0.01, score_each_iteration=T, 15-fold 0.78052 356
GBM - H2o No after reducing row sampling rate=0.4, column sampling rate=0.3 (roughly sqrt(n)/n), 9-fold 0.78052
ExtraTrees No mtry=13, numRandomCuts = 2, nodesize = 3, numThreads = 3, ntree=50 0.79220 224
ExtraTrees No mtry=10, numRandomCuts = 5, nodesize = 2, numThreads = 3, ntree=200 0.79247 224
ExtraTrees Yes mtry=13, numRandomCuts = 4, nodesize = 3, numThreads = 3, ntree=700 0.78739 308
ExtraTrees Yes tuned: mtry=56, numRandomCuts = 4, nodesize = 3, numThreads = 3, ntree=200 0.757000 613
Ensembles
extratrees_submission_ensemblegbm.csv No? 0.78808 ???
ensemble1 ? basic voting of RF, GBM, ExtraTrees and XGBoost 0.76 ?
ensemble2 ? Feeding ExtraTrees to H2o XGB 0.79182 ?
ensemble3 ? Feeding ExtraTrees to H2o XGB then to XGB again 0.79182 ?

ensemble1 | ? | basic voting of RF, GBM, Extratrees and XGboost | 0.76 | ? ensemble2 | ? | Feeding Extratrees to H20 XGB | 0.79182 | ? ensemble3 | ? | Feeding Extratrees to H20 XGB then to XGBoost | 0.79182 | ?

Conclusion

“The general characteristics of this class of problems is that some minority classes are distributed in small regions that are surrounded and overlapped with majority classes.” “When there is a minority class that is distributed in small isolated regions of featurespace, the BP learning mechanism seems to have great trouble”finding” the regions"

References: